A brazilian portuguese language corpus development

نویسندگان

  • Mauricio C. Schramm
  • Luis Felipe R. Freitas
  • Adriano Zanuz
  • Dante Barone
چکیده

This article presents the techniques that are being used for the creation of a database related to the Brazilian Portuguese language. This database is composed of a collection of recorded voices, from different speakers and different regions of Brazil. The collected voices contain varied phonetic and phonologic information. The applications of this database are diverse, including synthesis and recognition systems and data for linguistic studies. The corpus is composed of read sentences in Brazilian Portuguese, similar to sentences found in the TIMIT corpus, as well as answers to questions such as the speaker’s name, address, telephone number, ZIP code, and other information. The data were recorded at 44 kHz with a direct connection from the microphone to the sound card. The corpus contains information from about 200 speakers, although future development efforts will expand the corpus size to 1000 speakers. The paper covers in some detail the protocol used to design this corpus and the methods of data collection. An HMM/ANN-hybrid continuous digits recognizer developed using a small subset of this corpus has 96.18% word-level accuracy and 78.95% sentence level accuracy. This recognizer was trained on 48 files, developed using 11 files, and tested on 19 files, with an average of 5 digits per file. A total of 103 context-dependent categories were used in training. A generalpurpose recognizer capable of recognizing arbitrary words is currently under development. This article is within the context of the Spoltech Project that is a project on computational linguistic research. It aims to create, develop and improve the technologies of speech synthesis and recognition. This interdisciplinary project is composed of researchers, teachers and students of the Instituto de Informática and Instituto de Letras (Language and Literature College) of the Universidade Federal do Rio Grande do Sul, the Departamento de Informática of the Universidade de Caxias do Sul, CSLR/CU (University of Colorado, Boulder) and CSLU/OGI (Oregon Graduate Institute).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Brazilian Portuguese Lexicon: An Instrument for Psycholinguistic Research

In this article, we present the Brazilian Portuguese Lexicon, a new word-based corpus for psycholinguistic and computational linguistic research in Brazilian Portuguese. We describe the corpus development, the specific characteristics on the internet site and database for user access. We also perform distributional analyses of the corpus and comparisons to other current databases. Our main obje...

متن کامل

UNITEX-PB, a set of flexible language resources for Brazilian Portuguese∗

This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libraries to access compressed lexicons, and additional tools to validate those resources.

متن کامل

‘Minor’ Languages, ‘Broken’ Translations: On Brazilian Reworkings of an Albanian Novel

This essay approaches the challenges of global translation in the 21st century from what might still be considered a somewhat uncommon example: a direct translation of Ismail Kadaré's 1978 novel Prill e thyër (Broken April) from the original Albanian into Brazilian Portuguese in 2001. Not only does it examine and compare lexical elements in the source and target texts and the usage of translato...

متن کامل

Propbank-Br: a Brazilian Treebank annotated with semantic role labels

This paper reports the annotation of a Brazilian Portuguese Treebank with semantic role labels following Propbank guidelines. A different language and a different parser output impact the task and require some decisions on how to annotate the corpus. Therefore, a new annotation guide – called Propbank-Br has been generated to deal with specific language phenomena and parser problems. In this ph...

متن کامل

The Presence and Influence of English in the Portuguese Financial Media

As the lingua franca of the 21st century, English has become the main language for intercultural communication for those wanting to embrace globalization. In Portugal, it is the second language of most public and private domains influencing its culture and discourses. Language contact situations transform languages by the incorporations they make from other languages and Portugal has...

متن کامل

Baseline Acoustic Models for Brazilian Portuguese Using CMU Sphinx Tools

Advances in speech processing research rely on the availability of public resources such as corpora, statistical models and baseline systems. In contrast to languages such as English, there are few specific resources for Brazilian Portuguese. This work describes efforts aiming to decrease such gap. Baseline acoustic models for Brazilian Portuguese were built using the CMU Sphinx toolkit and pub...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000